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Abstract 

The multicanonical method has been proven powerful for statistical investigations of lattice and off-lattice systems 
throughout the last two decades. We discuss an intuitive but very efficient parallel implementation of this algorithm 
and analyze its scaling properties for discrete energy systems, namely the Ising model and the 8-state Potts model. 
The parallelization relies on independent equilibrium simulations in each iteration with identical weights, merging their 
statistics in order to obtain estimates for the successive weights. With good care, this allows faster investigations of large 
systems, because it distributes the time-consuming weight-iteration procedure and allows parallel production runs. We 
show that the parallel implementation scales very well for the simple Ising model, while the performance of the 8-state 
Potts model, which exhibits a first-order phase transition, is limited due to emerging barriers and the resulting large 
integrated autocorrelation times. The quality of estimates in parallel production runs remains of the same order at same 
statistical cost. 
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1. Introduction 

Monte Carlo simulations are an important tool to in- 
vestigate a wide range of theoretical models with respect 
to their statistical properties such as phase transitions, 
structure formation and more. Throughout the last two 
decades, umbrella sampling algorithms like the multicanon- 
ical [U [5] or the Wang-Landau [3] algorithm have been 
proven to be very powerful for investigations of statistical 
phenomena, especially first-order phase transitions, for lat- 
tice and off-lattice models. They have been applied to a 
variety of systems with rugged free-energy landscapes in 
physics, chemistry and structural biology j4j. 

Due to the fact that computer performance increases 
mainly in terms of parallel processing on multi-core archi- 
tectures, a parallel implementation is of great interest, if 
the additional cores bring a benefit to the required simu- 
lation time. We present the scaling properties of a simple 
and straightforward parallelization of the multicanonical 
method, which has been reported in a similar way in [S] 
without much detail to the performance. This paralleliza- 
tion considers independent Markov chains, keeping com- 
munication to a minimum. Thus, it can be added on top 
of the multicanonical algorithm without much modifica- 
tion or system-dependent considerations and is also suit- 
able for systems with simple energy calculation. Similar 
to this parallelization, there have been previous reports for 
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the Wang-Landau algorithm [SI 'T' , which needed a little 
more adaption to the algorithm. 

2. Multicanonical Algorithm 

The multicanonical method allows to sample a system 
over a range of canonical ensembles at the same time. This 
is possible, because the statistical weights are modified in 
such a way that the simulation reaches each configuration 
energy of a chosen interval with equal probability, resulting 
in a flat energy histogram. To this end, the canonical 
partition function, in terms of the density of states i^{E), 
is modified in the following way: 
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In order that each energy state occurs with the same prob- 
ability, as requested above, the statistical weights have to 
equal the inverse density of states W{E) = il~^(£'). After 
an equilibrium simulation with those weights, it is possible 
to reweight to all canonical ensembles with a Boltzmann 
energy distribution covered by the flat histogram. This 
can be done for example by time-series reweighting, where 
in the average each measured observable is multiplied with 
its desired weight and divided by the weight with which it 
was measured: 



(2) 
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Of course, the density of states and consequently the 
weights that yield a flat energy histogram are not known 
in advance. Therefore the weights have to be obtained 
iteratively. In the most simple way consecutive weights 
are obtained from the last weights and the current energy 
histogram, W^'^+^^E) = W^'^\E) / H^'^^E). More sophis- 
ticated methods exist, where the full statistics of previous 
iterations is used for a stable and efficient approximation 
of the density of states . All our simulations use this re- 
cursive version with logarithmic weights in order to avoid 
numerical problems. 

Parallel Version 

The idea of this parallel implementation, similar to [S] , 
is to distribute the time consuming generation of statistics 
on p independent processes. All processes perform equilib- 
rium simulations with identical weig 

1, but with different random number seeds, resulting 
in similar but independent energy histograms hI^\e). 
The histograms are merged after each iteration and one 
ends up with H^'^-^E) = Y.,H-"\e). According to the 
weight modification of choice, the collected histogram is 
processed together with the previous weights in order to 
estimate the consecutive weights . The new weights 

are distributed onto all processes, which run equilibrium 
simulations again. That way, the computational effort may 
be distributed on several cores, allowing to generate the 
same amount of statistics in a fraction of the time. Impor- 
tant to notice is that a modification of the program only 
influences the histogram merging and the distribution of 
the new weights, see Fig. [l] The iterations are indepen- 
dent copies run in parallel and the weight modification is 
performed on the master process as in the non-parallelized 
case. 
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Figure 1: Scheme of the parallel implementation of the multicanon- 
ical algorithm on p cores. After each iteration with independent 
Markov chains but identical weights, the histograms are merged, the 
new weights are estimated and the weights are distributed onto all 
processes again. 



3. Systems and Implementation Issues 

We consider two discrete two-dimensional spin systems, 
namely the well known Ising model and the g-state Potts 
model with q = 8, where the Ising model can be mapped 



onto the q — 2 Potts model. The Ising model exhibits a 
second-order phase transition at /3o = In (l + V2) /2 and 
the 8-state Potts model exhibits a first-order phase transi- 
tion at /3o = In (l -I- ^/8) . The spins are located on a square 
lattice with side length L and interact only between near- 
est neighbors. In case of the Ising model, the interaction 
is described by the Hamiltonian 



^ (Ising) ^ 



(3) 



where J is the coupling constant and Si,Sj can take the val- 
ues {—1, 1}. For the g-state Potts model, where each site 
assumes values from {0, . . . ,q — 1}, the nearest-neighbors 
interaction is described by 



(Potts) 



(4) 



where 5{si,Sj) is the Kronecker-Delta function which is 
only non-zero in case Si = Sj. 

In those two cases the number of discrete energy states 
is equal to the number of lattice sites V = L^, such that 
the width of the energy range increases quickly with sys- 
tem size. The simulations in this study start at infinite 
temperature, i.e., /3 = 1/T = with quite narrow energy 
histograms. Because an estimation of successive weights is 
only possible for energies with non-zero histogram entries, 
the number of iterations may be very large for wide energy 
ranges. In order to ensure faster convergence, our imple- 
mentation includes after each estimation of weights a cor- 
rection function, which linearly interpolates the logarith- 
mic weights at the boundaries of the sampled region (with 
a range of L bins), allowing the next iteration to sample 
a larger energy region. The MUCA weights are converged 
if the last iteration covered the full energy range and all 
histogram entries are within half and twice the average his- 
togram entry. Between convergence of the weights and the 
final production run, the systems are thermalized again in 
order to yield correct estimates of the observables. In both 
cases, each sweep includes V number of spin updates. 

4. Performance and Scaling 

In order to estimate the performance and the speedup 
of the parallel algorithm appropriately, we performed the 
analysis in two steps. First, we estimated the optimal 
number of sweeps per iteration and core, which we will 
refer to as M. To this end, we performed parallel MUCA 
simulations over a wide range of M for different lattice 
sizes L and number of cores p. The simulations were ther- 
malized once in the beginning on every core, continuing 
the next iteration with the last state of the previous itera- 
tion on that core. This violates the equilibrium condition 
a little, as no intermediate thermalization phase was ap- 
plied and part of the iteration was needed to reach equi- 
librium. This is accepted in order to compare the perfor- 
mance equally without an additional parameter to opti- 
mize next to M. Furthermore, the physical results were 
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Figure 2: (left) Example for the estimation of -Mopt for the 8x8 Ising model with 1 and 8 cores. The minimum of the total statistics Nit^^Mp 
with respect to M yields Afopt- (right) Plot of Mopt versus the number of cores p used in the parallel MUCA simulation for the Ising (filled 
symbols) and the 8-state Potts (open symbols) model. The system sizes are 8x8 (red squares), 16 X 16 (green circles), 32 X 32 (blue triangles) 
and 64 X 64 (black diamonds). Fitted to the data points is the power-law dependence JSjI. 



not influenced, because each Markov chain was thermal- 
ized before the final production run. We determined the 
mean number of iterations until convergence to a flat his- 
togram iViter as the average over 10 simulations at different 
initial seeds. Plotting the total number of sweeps NucrMp 
versus M, we can estimate the optimal number of sweeps 
per iteration and core Mopt as the minimum of this func- 
tion [see Fig. [2f^left)]. For a small number of cores, this 
curve has a rather broad minimum, introducing a rough 
estimate. If, on the other hand, we stretch the curves 
along the a;-axis with the number of cores, the outcomes 
look quite similar. 

Selected results of the estimation of Afopt are shown in 
Fig. [2];right). We see that for different lattice sizes and 
spin models the dependence on the number of cores may 
be described by a 1/p power law, where the amphtudes 
seem to depend on the system size and the number of 
states (notice that the Potts model curves nearly coincide 
with those of the Ising model with 4 times system size) . In 
order to measure the performance equally, it is convenient 
to describe Mopt by a function of system size L and number 
of cores p, Mopt{L,p) « Mopt{L,p) = Mi{L)/p, where Mi 
is the interpolated optimal M for one core. Therefore, 
we estimated Mopt for the square lattice sizes 8, 16, 24, 
32, 48, 64, 96 (latter two only for Ising) with p < 32 and 
fitted for fixed size Mopt{L,p) = Mi{L)/p. The obtained 
Mi(L) were plotted over L and fitted with a power law 
(see Fig. [3]). In the end, the optimal number of sweeps per 
core and iteration were systematically described by the 
functions 
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Figure 3: The optimal number of sweeps for a single core Mi(L) 
together with its fit. Each data point is interpolated by fitting 
Afopt (p) = A/i/p to the estimated Afopt for p < 32. 



through energy space has to depend on the total system 
size and the number of spin states. This power-law behav- 
ior corresponds roughly to the scaling of the multicanoni- 
cal tunneling times in previous works [5J [T] . The explicit 
functional dependence ([s]) is characteristic for our specific 
implementation and will be the basis for our performance 
study (including larger lattice sizes). Interesting to no- 
tice is the prefactor ratio between 8-state Potts and Ising 
(which is a 2-Potts model) of about 4, corresponding to 
the increase in the number of spin states. 

Afterwards, we performed parallel MUCA simulations 
with different number of cores for each system size, using 
Mapt{L,p) in order to compare the optimal performance at 
each degree of parallelization. To this end, we considered 
the speedup for p cores, defined in terms of real time tp 
until convergence of the MUCA weights. 



which can be justified, considering that a random walk 
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Figure 4: Performance for the Ising model over different system sizes expressed by (left) the speedup factor and (right) the time-independent 
statistical speedup factor. 
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Figure 5: Performance for the 8-state Potts model over different system sizes expressed by (left) the speedup factor and (right) the time- 
independent statistical speedup factor. 
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Figure 6: Performance for the multimagnetic simuation of the Ising model at /9 = (3/2)/3o over different system sizes expressed by (left) the 
speedup factor and (right) the time-independent statistical speedup factor. 
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as well as the time-independent statistical speedup, de- 
fined in terms of total number of sweeps on each core until 
convergence Mtcr A^opt ( i , p) , 

^ [jViterMopt(i:,l)]l 
^ [iViterMopt(L,p)]p' ^' 

where the subscript indicates the number of cores used. In 
order to estimate the mean performance, we averaged the 
required time and number of sweeps over 32 independent 
PMUCA simulations for each data point. The results for 
the Ising model are shown in Fig. [4j revealing that the par- 
allel implementation of MUCA results in a linear speedup 
up to 64 cores already for system sizes L > 24. In case 
of small system sizes L = 8, the small Mopt leads to con- 
vergence of MUCA weights within milliseconds, which is 
difficult to measure precisely. In addition, it can be seen 
that the statistical speedup scales linearly for all system 
sizes investigated. This means that the total required num- 
ber of sweeps until convergence does not increase with the 
number of cores (compare also Fig. [2];ieft)). This indi- 
cates no slowdown of the parallelization other than from 
communication, which is kept to a minimum. 

In case of the 8-state Potts model, the performance 
does not scale as linearly with the number of cores, see 
Fig. [5j but is still satisfying. The reason for the drop in 
performance may be found in the first-order aspect [5] of 
the Potts model. In the transition from the disordered 
to an ordered phase the model undergoes a droplet-strip 
transition [TD]. Previous work on multimagnetic simula- 
tions of the Ising model [TT] showed that such a transition 
(and also the droplet-condensation transition) is accom- 
panied by "hidden" barriers, which are not directly re- 
flected in the multimagnetic or multicanonical histograms 
and hence are difficult to overcome. We applied the par- 
allel version of the multicanonical method also to a mul- 
timagnetic simulation of the Ising model at (3 — (3/2)/3o, 
determined Mopt{p) for the lattice sizes L — 8,16,24,32 
and measured the speedup factor. The result can be seen 
in Fig. |6] and shows the same drop in performance as we 
can observe for the 8-state Potts model. 

With this picture in mind, the drop in performance 
may be explained by the fact that with increasing p we 
decreased the number of sweeps per iteration and thus de- 
creased the chance to efficiently cross emerging barriers. 
When reducing the number of sweeps per iteration as a 
consequence of parallelization, this reaches a point where 
the number of sweeps are of the order of the integrated 
autocorrelation time r and each Markov chain is strongly 
correlated, see also Fig. [Tj Exemplary measurements of 
r in the 8-state Potts model for different lattice sizes re- 
vealed that it was of the order of Mopt {L, 8) to Mopt (-^, 16) , 
verifying the drop in performance for p > 16. This gives a 
limit of parallelization depending on the barriers and the 
associated autocorrelation times of the system. 

From a practical point of view, when simulating com- 
plex systems, it may be advantageous to introduce short 



^ T 

p=l 



p=4 r 1 

I — '- 1 

p=8 I * ^ ' I 

I — '- — I 

Figure 7: Scheme of the number of sweeps per iteration and core in 
comparison to the integrated autocorrelation time r. If the number 
of sweeps per core gets smaller than the integrated autocorrelation 
time of the Markov chain, the convergence of the MUCA weights 
gets worse and the performance drops. 

thermalization phases between iterations. Exemplary in- 
vestigations of the 8-state Potts model with intermediate 
thermalization showed that the number of iterations can 
be reduced significantly while introducing additional com- 
putation time. 

5. Quality 

The parallel MUCA weight recursion can be extended 
by a parallel production run, acquiring data with indepen- 
dent Markov chains. This allows equally good estimation 
of observables for a constant number of total measure- 
ments, if it is ensured that each Markov chain samples the 
desired phase space appropriately. For the Ising model, 
we considered the relative deviation from the exact result 
Oex |12| and averaged over a temperature range around 
the critical temperature 

m^wmj: ^"'%-°f-\ (8) 

In this case the range was chosen to be /3 G [3/10,6/10] 
with N — 300 steps. Figure |8] shows the average deviation 
of the specific heat {dCy) (Cy = (3^ {{E^) - (Ef) /V) for 
different sizes of the Ising model. The error was estimated 
by averaging over 16 independent runs. In each simula- 
tion there was an additional thermalization phase after the 
MUCA-weight convergence and the final production run 
was performed with 30 x Mopt(L,p) measurement points 
and 50 sweeps between measurements. The decrease of the 
relative deviation with increasing lattice size comes from 
this choice. From Fig. [sjright) it can be seen that, for a 
given system size, the relative deviations remain constant 
within the statistical error for all p. 

In order to verify the quality of the parallel simula- 
tion of the 8-state Potts model, we estimated the scaling 
of the order-disorder interface tension dod and compared 
it to analytic results [121 . The interface tension can be 
approximated in terms of the probability density of the 
histograms at the transition temperature |14j . 

aod = lim -5- In ( ) . (9) 
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Figure 8: (left) Reweighted specific heat from an Ising simulation with 64 cores and the exact results as well as (right) its relative deviation 
from the exact solution over the number of cores with constant total number of measurements. 
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Figure 9: (left) Reweighted normalized mean energy histograms and (right) the scaling of the order-disorder interface tension for the 8-state 
Potts model simulated on 64 cores. 



This requires to first find tlic temperature for wliich tlie 
reweigfited energy histogram shows two equally high peaks 
[see Fig.[9]jleft)] and then to estimate the ratio between the 
maximum and minimum. Figure [oj^right) shows a rough 
scaling of the order-disorder interface tension for several 
system sizes up to 96 x 96, simulated with 64 cores. The 
fit to the larger system sizes yields an infinite lattice limit 
Cod ~ 0.045, which is consistent with the exact result [13] 
and verifies that the parallel implementation yields correct 
results. 

6. Conclusion 

We presented a straightforward parallel implementa- 
tion of the multicanonical algorithm and showed that its 
scaling properties with the number of cores are very good 
for the Ising model and adequate for the 8-state Potts 
model. The latter one suffers from emerging barriers at the 
first-order phase transition, resulting in large integrated 



autocorrelation times. The parallelization profits from a 
minimal amount of communication because histograms are 
merged at the end of each iteration. This is the main rea- 
son why it is difficult to adapt this method to weight re- 
cursions of the Wang-Landau type where the weights are 
changed after each spin update. Since it would be interest- 
ing to find a similar approach for those weight recursions, 
not relying on shared memory, we are currently investi- 
gating this problem. The major advantage of the imple- 
mentation employed here lies in the fact, that no greater 
adjustment to the usual implementation is necessary and 
that additional modifications may be carried along. Thus, 
it can be easily generalized to complex systems, e.g. (spin) 
glasses or (bio) polymers, and allows good convergence if 
it is ensured that the number of sweeps per core is greater 
than the integrated autocorrelation times. 
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